##
## The downloaded binary packages are in
## /var/folders/kq/s3xx2qyd4d30jsg1ym75hqfw0000gp/T//RtmpLDrsUf/downloaded_packages
We’ll begin by reviewing concepts and labs from last week.
Then we’ll start talking about strategies for making causal claims in observational studies.
After today, you should be able to explain:
The key conceptual points from last week are:
02_lab_2_comments.RmdIf treatment is not randomly assigned then:
\[ Y_i(1),Y_i(0) \text{ is not} \perp D_i \]
However, in some situations, it may be plausible to claim that conditional on some variable(s) \(X\), the distribution of potential outcomes \(Y\) is the same (independent) across levels of treatment \(D\) (conditional ignorability)
\[ Y_i(1),Y_i(0) \perp D_i |X_i \]
\[ Y_i(1),Y_i(0) \perp D_i |X_i \]
Recall that true randomization implies ignorability/independence between treatment and potential outcomes
\[ Y_i(1),Y_i(0) \perp D_i \]
Such that not only does
\[ E[Y(1)|D=1]=E[Y(1)]=E(Y(1)|D=0) \]
But also
\[ E[X_i|D=1]=E[X_i]=E(X_i|D=0) \]
In words, the distribution of pre-treatment (why?) covariates (both observed and unobserved) should be similar across treatment and control groups.
While we can’t prove our assumption of as-if randomization :
\[ Y_i(1),Y_i(0) \perp D_i |X \]
We can test its observable implications.
Suppose we were interested in the effects of some job training program on future earnings (Y). Suppose younger people are more likely to have completed this program than older people.
Just looking at outcomes between participants and non-participants, would conflate the effects of the program with the effects of age (and the things correlated with age like education).
So we might expect that levels of education between these groups would vary.
\[ E[Education|D=1]\neq E(Education|D=0) \]
But if age was the only thing that distinguished participants from non-participants, then \(Y_i(1),Y_i(0) \perp D_i |X\), we could estimate a conditional average treatment effect
\[ CATE= E[Y|D=1,Age]-E(Y|D=0,Age) \]
Further, to test our assumption that conditional on age, job-training recipients are the same as non-job training recipients, would could look at the conditional distributions of other covariates, like education. If our assumption holds, we would expect that
\[ E[Education|D=1,Age] = E(Education|D=0,Age) \]
Or that the difference between these means is small enough that it’s plausible to have arisen just by chance, making our claim that \(Y_i(1),Y_i(0) \perp D_i |X\) is more credible.
We’d hope this type of equality
\[ E[Education|D=1,Age]=E(Education|D=0,Age) \]
Were true for all covariates. We can test it for observed covariates (things we call \(X\)), but may still worry about unobserved covariates (things we call \(U\)) like innate ability
\[ E[Ability|D=1,Age]\neq E(Ability|D=0,Age) \]
You already have, combining the mean() function with logical indexes
# Load data from last lab
load("fulldata.rda")
# Subset data
df <- df[!is.na(df$treatment_group), ]
# Uncondition mean
mean(df$therm_trans_t1,na.rm=T)
## [1] 56.7972
# Conditional mean for those who received the intervetnion
# E(therm_trans_t1 |treatment_group = "Trans Equality")
mean(df$therm_trans_t1[df$treatment_group == "Trans-Equality"],na.rm=T)
## [1] 60.18627
# Conditional means by treatment and age
# E(therm_trans_t1 |treatment_group = "Trans Equality" age > 30)
# Olds
mean(df$therm_trans_t1[df$treatment_group == "Trans-Equality" & df$vf_age > 30],na.rm=T)
## [1] 58.73529
mean(df$therm_trans_t1[df$treatment_group == "Recycling" & df$vf_age > 30],na.rm=T)
## [1] 52.80791
# Youths
mean(df$therm_trans_t1[df$treatment_group == "Trans-Equality" & df$vf_age <= 30],na.rm=T)
## [1] 67.44118
mean(df$therm_trans_t1[df$treatment_group == "Recycling" & df$vf_age <= 30],na.rm=T)
## [1] 57.10417
Directed acyclic graphs are way of representing the causal relationships between variables
Let’s consider a simple example examining the relationship between ice cream sales and crime
The GQ approach
What do me mean when we say something is model dependent?